-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Start poller after fast sync is completed #54
Conversation
@@ -97,6 +97,10 @@ func (cp *ChainPoller) Start(startHeight uint64) error { | |||
} | |||
|
|||
func (cp *ChainPoller) Stop() error { | |||
if !cp.IsRunning() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without this, TestOpFpNoVotingPower
will fail b/c it's trying to stop a poller that's not started
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this the same as !cp.isStarted.Swap(false)
? It's just nil is returned instead of an error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, now TestOpFpNoVotingPower
will start the poller, it can be removed.
} | ||
|
||
fp.poller = poller | ||
// Start the poller if fast sync is disabled or there's no finalized block | ||
if (fp.cfg.FastSyncInterval == 0 || lastFinalizedBlock == nil) && !fp.poller.IsRunning() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move !fp.poller.IsRunning()
out to a new if condition? imo if !fp.poller.IsRunning()
then we need to return err
? Also if this if statement is false, when will the poller start?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this if
statement is false
, we start the poller after fast sync is finished.
// start poller after fast sync is finished
if !fp.poller.IsRunning() {
err := fp.poller.Start(res.LastProcessedHeight + 1)
if err != nil {
fp.logger.Error("failed to start the poller", zap.Error(err))
fp.reportCriticalErr(err)
}
continue
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure we should add these check because up to here the fast sync is already finished and should be safe to start the poller. Can we have a workaround in e2e?
Thanks for the proposed fix! @gitferry what do you think about starting poller only after fast sync? actually I remember we had a discussion on it but somehow did not push it further Also, if this idea is approved, how about merging it to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the previous implementation, the poller is started after bootstrap
is finished which involves fast sync
, so I don't understand what specific case does this PR want to address. Did I miss anything?
@@ -97,6 +97,10 @@ func (cp *ChainPoller) Start(startHeight uint64) error { | |||
} | |||
|
|||
func (cp *ChainPoller) Stop() error { | |||
if !cp.IsRunning() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this the same as !cp.isStarted.Swap(false)
? It's just nil is returned instead of an error
@gitferry. ok we just took another look at the logs. so what happened was: after topping up BBN and restarting the FP,
then it will consider bootstrap complete and prints "the finality-provider has been bootstrapped" see relevant logs
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good if @gitferry can take a final look 👍
} | ||
|
||
fp.poller = poller | ||
// Start the poller if fast sync is disabled or there's no finalized block | ||
if (fp.cfg.FastSyncInterval == 0 || lastFinalizedBlock == nil) && !fp.poller.IsRunning() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move !fp.poller.IsRunning()
out to a new if condition? imo if !fp.poller.IsRunning()
then we need to return err
? Also if this if statement is false, when will the poller start?
It seems that the issue is that fast sync reaches the latest height that pub rand is committed but still far from the tip height, so the fast sync got triggered frequently causing the poller stuck. If it is true then the issue should be why the pub rand is not caught up because we have a separate goroutine to commit pub rand continuously. |
@lesterli could you help take a look later today? i will be mostly in events today |
after checking the FP codes, the There is a similar case when fp was just started:
|
@gitferry in the |
@gitferry so the problem is fast sync cannot fast-sync pub randomnesses. and it doesn't make sense to fast-sync pub randomnesses either b/c iirc we need to wait for some time for a pub rand to be useful. so it seems we will still need this change |
Yes, we have the check at the beginning of |
cp.logger.Info("the poller is already started") | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change to return nil
? If so, it will hide the fact that the poller is called twice, which we do not expect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for a mistake, this update is for the Stop
function to fix the failed e2e TestFastSync
.
} | ||
|
||
fp.poller = poller | ||
// Start the poller if fast sync is disabled or there's no finalized block | ||
if (fp.cfg.FastSyncInterval == 0 || lastFinalizedBlock == nil) && !fp.poller.IsRunning() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure we should add these check because up to here the fast sync is already finished and should be safe to start the poller. Can we have a workaround in e2e?
if !fp.poller.IsRunning() { | ||
err := fp.poller.Start(res.LastProcessedHeight + 1) | ||
if err != nil { | ||
fp.logger.Error("failed to start the poller", zap.Error(err)) | ||
fp.reportCriticalErr(err) | ||
} | ||
continue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the start of the poller should be controlled within the sig submission loop. The poller should be already started when the loop is on
we have a separate loop continuously committing pub rand so I think eventually fast-sync will go to the expected height |
right but the problem is bootstrap () will not wait for it. thus the edge case i.e. poller channel gets full still exists. and this PR tries to fix that |
ok we just realized that this PR still has some problems. i.e. pubrand is lagging a lot so fast sync exits early https://github.com/babylonlabs-io/finality-provider/compare/base/consumer-chain-support...fix/tmp-fix-poller?expand=1 this might be an easier fix to avoid all the headaches but still resolves #53 basically there are 3 ways I can think of:
|
another discussion topic is whether we should fast sync pub rand |
discussed w @gitferry.
|
created #77 |
closing as we have removed fast sync |
Summary
this is to fix #53
it also makes sense b/c fast sync will fetch the blocks anyway. so it's a waste of resources as well if there is another poller routine doing the same thing
Test Plan
tested this on GCP, waiting for the FP to run out of BBN. Then, after restarting FP, the sigs submission will work fine. https://gist.github.com/lesterli/830f6ec7cac7c8a9d43b251039d20c8b